{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {},
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W1D3_ModelFitting/student/W1D3_Tutorial3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a> &nbsp; <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/student/W4_TutorialBonus2.ipynb\" target=\"_parent\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" alt=\"Open in Kaggle\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "\n",
    "# (Bonus) Model Fitting: Confidence intervals and bootstrapping\n",
    "\n",
    "**Original content creators**: Pierre-Étienne Fiquet, Anqi Wu, Alex Hyafil with help from Byron Galbraith\n",
    "\n",
    "**Content modified**: Kai Chen "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Tutorial Objectives\n",
    "\n",
    "In this tutorial, we wil discuss how to gauge how good our estimated model parameters are. \n",
    "- Learn how to use bootstrapping to generate new sample datasets\n",
    "- Estimate our model parameter on these new sample datasets\n",
    "- Quantify the variance of our estimate using confidence intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Up to this point we have been finding ways to estimate model parameters to fit some observed data. Our approach has been to optimize some criterion, either minimize the mean squared error or maximize the likelihood while using the entire dataset. How good is our estimate really? How confident are we that it will generalize to describe new data we haven't seen yet?\n",
    "\n",
    "One solution to this is to just collect more data and check the MSE on this new dataset with the previously estimated parameters. However this is not always feasible and still leaves open the question of how quantifiably confident we are in the accuracy of our model.\n",
    "\n",
    "In Section 1, we will explore how to implement bootstrapping. In Section 2, we will build confidence intervals of our estimates using the bootstrapping method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title Figure Settings\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "nma_style = {\n",
    "    'figure.figsize' : (8, 6),\n",
    "    'figure.autolayout' : True,\n",
    "    'font.size' : 15,\n",
    "    'xtick.labelsize' : 'small',\n",
    "    'ytick.labelsize' : 'small',\n",
    "    'legend.fontsize' : 'small',\n",
    "    'axes.spines.top' : False,\n",
    "    'axes.spines.right' : False,\n",
    "    'xtick.major.size' : 5,\n",
    "    'ytick.major.size' : 5,\n",
    "}\n",
    "for key, value in nma_style.items():\n",
    "    plt.rcParams[key] = value\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title Helper Functions\n",
    "def solve_normal_eqn(x, y):\n",
    "  \"\"\"Solve the normal equations to produce the value of theta_hat that minimizes\n",
    "    MSE.\n",
    "\n",
    "    Args:\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "    thata_hat (float): An estimate of the slope parameter.\n",
    "\n",
    "  Returns:\n",
    "    float: the value for theta_hat arrived from minimizing MSE\n",
    "  \"\"\"\n",
    "  theta_hat = (x.T @ y) / (x.T @ x)\n",
    "  return theta_hat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 1: Bootstrapping\n",
    "\n",
    "[Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is a widely applicable method to assess confidence/uncertainty about estimated parameters, it was originally [proposed](https://projecteuclid.org/euclid.aos/1176344552) by [Bradley Efron](https://en.wikipedia.org/wiki/Bradley_Efron). The idea is to generate many new synthetic datasets from the initial true dataset by randomly sampling from it, then finding estimators for each one of these new datasets, and finally looking at the distribution of all these estimators to quantify our confidence.\n",
    "\n",
    "Note that each new resampled datasets will be the same size as our original one, with the new data points sampled with replacement i.e. we can repeat the same data point multiple times. Also note that in practice we need a lot of resampled datasets, here we use 2000.\n",
    "\n",
    "To explore this idea, we will start again with our noisy samples along the line $y_n = 1.2x_n + \\epsilon_n$, but this time only use half the data points as last time (15 instead of 30)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Execute this cell to simulate some data\n",
    "\n",
    "# setting a fixed seed to our random number generator ensures we will always\n",
    "# get the same psuedorandom number sequence\n",
    "np.random.seed(121)\n",
    "\n",
    "# Let's set some parameters\n",
    "theta = 1.2\n",
    "n_samples = 15\n",
    "\n",
    "# Draw x and then calculate y\n",
    "x = 10 * np.random.rand(n_samples)  # sample from a uniform distribution over [0,10)\n",
    "noise = np.random.randn(n_samples)  # sample from a standard normal distribution\n",
    "y = theta * x + noise\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "ax.scatter(x, y)  # produces a scatter plot\n",
    "ax.set(xlabel='x', ylabel='y');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Exercise 1: Resample Dataset with Replacement\n",
    "\n",
    "In this exercise you will implement a method to resample a dataset with replacement. The method accepts $x$ and $y$ arrays. It should return a new set of $x'$ and $y'$ arrays that are created by randomly sampling from the originals.\n",
    "\n",
    "We will then compare the original dataset to a resampled dataset.\n",
    "\n",
    "TIP: The [numpy.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) method would be useful here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def resample_with_replacement(x, y):\n",
    "  \"\"\"Resample data points with replacement from the dataset of `x` inputs and\n",
    "  `y` measurements.\n",
    "\n",
    "  Args:\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "\n",
    "  Returns:\n",
    "    ndarray, ndarray: The newly resampled `x` and `y` data points.\n",
    "  \"\"\"\n",
    "  #######################################################\n",
    "  ## TODO for students: resample dataset with replacement\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student exercise: resample dataset with replacement\")\n",
    "  #######################################################\n",
    "\n",
    "  # Get array of indices for resampled points\n",
    "  sample_idx = ...\n",
    "\n",
    "  # Sample from x and y according to sample_idx\n",
    "  x_ = ...\n",
    "  y_ = ...\n",
    "  return x_, y_\n",
    "\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))\n",
    "ax1.scatter(x, y)\n",
    "ax1.set(title='Original', xlabel='x', ylabel='y')\n",
    "\n",
    "# Uncomment below to test your function\n",
    "#x_, y_ = resample_with_replacement(x, y)\n",
    "#ax2.scatter(x_, y_, color='c')\n",
    "\n",
    "ax2.set(title='Resampled', xlabel='x', ylabel='y',\n",
    "        xlim=ax1.get_xlim(), ylim=ax1.get_ylim());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellView": "both",
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus2_Solution_fe37e229.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=1690.0 height=683.0 src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/static/W4_TutorialBonus2_Solution_fe37e229_3.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "In the resampled plot on the right, the actual number of points is the same, but some have been repeated so they only display once.\n",
    "\n",
    "Now that we have a way to resample the data, we can use that in the full bootstrapping process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Exercise 2: Bootstrap Estimates\n",
    "\n",
    "In this exercise you will implement a method to run the bootstrap process of generating a set of $\\hat\\theta$ values from a dataset of $x$ inputs and $y$ measurements. You should use `resample_with_replacement` here, and you may also invoke helper function `solve_normal_eqn` from Tutorial 1 to produce the MSE-based estimator.\n",
    "\n",
    "We will then use this function to look at the theta_hat from different samples.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def bootstrap_estimates(x, y, n=2000):\n",
    "  \"\"\"Generate a set of theta_hat estimates using the bootstrap method.\n",
    "\n",
    "  Args:\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "    n (int): The number of estimates to compute\n",
    "\n",
    "  Returns:\n",
    "    ndarray: An array of estimated parameters with size (n,)\n",
    "  \"\"\"\n",
    "  theta_hats = np.zeros(n)\n",
    "\n",
    "  ##############################################################################\n",
    "  ## TODO for students: implement bootstrap estimation\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student exercise: implement bootstrap estimation\")\n",
    "  ##############################################################################\n",
    "\n",
    "  # Loop over number of estimates\n",
    "  for i in range(n):\n",
    "\n",
    "    # Resample x and y\n",
    "    x_, y_ = ...\n",
    "\n",
    "    # Compute theta_hat for this sample\n",
    "    theta_hats[i] = ...\n",
    "\n",
    "  return theta_hats\n",
    "\n",
    "\n",
    "np.random.seed(123)  # set random seed for checking solutions\n",
    "\n",
    "# Uncomment below to test function\n",
    "# theta_hats = bootstrap_estimates(x, y, n=2000)\n",
    "# print(theta_hats[0:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellView": "both",
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus2_Solution_315a8d02.py)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "You should see `[1.27550888 1.17317819 1.18198819 1.25329255 1.20714664]` as the first five estimates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Now that we have our bootstrap estimates, we can visualize all the potential models (models computed with different resampling) together to see how distributed they are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute this cell to visualize all potential models\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "# For each theta_hat, plot model\n",
    "theta_hats = bootstrap_estimates(x, y, n=2000)\n",
    "for i, theta_hat in enumerate(theta_hats):\n",
    "  y_hat = theta_hat * x\n",
    "  ax.plot(x, y_hat, c='r', alpha=0.01, label='Resampled Fits' if i==0 else '')\n",
    "\n",
    "# Plot observed data\n",
    "ax.scatter(x, y, label='Observed')\n",
    "\n",
    "# Plot true fit data\n",
    "y_true = theta * x\n",
    "ax.plot(x, y_true, 'g', linewidth=2, label='True Model')\n",
    "\n",
    "ax.set(\n",
    "  title='Bootstrapped Slope Estimation',\n",
    "  xlabel='x',\n",
    "  ylabel='y'\n",
    ")\n",
    "\n",
    "# Change legend line alpha property\n",
    "handles, labels = ax.get_legend_handles_labels()\n",
    "handles[0].set_alpha(1)\n",
    "\n",
    "ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "This looks pretty good! The bootstrapped estimates spread around the true model, as we would have hoped. Note that here we have the luxury to know the ground truth value for $\\theta$, but in applications we are trying to guess it from data. Therefore, assessing  the  quality  of  estimates  based  on  finite data is a task of fundamental importance in data analysis.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 2: Confidence Intervals\n",
    "\n",
    "Let us now quantify how uncertain our estimated slope is. We do so by computing [confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval) (CIs) from our bootstrapped estimates. The most direct approach is to compute percentiles from the empirical distribution of bootstrapped estimates. Note that this is widely applicable as we are not assuming that this empirical distribution is Gaussian."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Execute this cell to plot bootstrapped CI\n",
    "\n",
    "theta_hats = bootstrap_estimates(x, y, n=2000)\n",
    "print(f\"mean = {np.mean(theta_hats):.2f}, std = {np.std(theta_hats):.2f}\")\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "ax.hist(theta_hats, bins=20, facecolor='C1', alpha=0.75)\n",
    "ax.axvline(theta, c='g', label=r'True $\\theta$')\n",
    "ax.axvline(np.percentile(theta_hats, 50), color='r', label='Median')\n",
    "ax.axvline(np.percentile(theta_hats, 2.5), color='b', label='95% CI')\n",
    "ax.axvline(np.percentile(theta_hats, 97.5), color='b')\n",
    "ax.legend()\n",
    "ax.set(\n",
    "    title='Bootstrapped Confidence Interval',\n",
    "    xlabel=r'$\\hat{{\\theta}}$',\n",
    "    ylabel='count',\n",
    "    xlim=[1.0, 1.5]\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Looking at the distribution of bootstrapped $\\hat{\\theta}$ values, we see that the true $\\theta$ falls well within the 95% confidence interval, wich is reinsuring. We also see that the value $\\theta = 1$ does not fall within the confidence interval. From this we would reject the hypothesis that the slope was 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "- Bootstrapping is a resampling procedure that allows to build confidence intervals around inferred parameter values\n",
    "- it is a widely applicable and very practical method that relies on computational power and pseudo-random number generators (as opposed to more classical approaches than depend on analytical derivations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "**Suggested readings**  \n",
    "\n",
    "Computer Age Statistical Inference: Algorithms, Evidence and Data Science, by Bradley Efron and Trevor Hastie\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W4_TutorialBonus2",
   "provenance": [],
   "toc_visible": true
  },
  "interpreter": {
   "hash": "9516f62da91337f10c2adbe814d9c63a4b08f8271333386358218606edb781e3"
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3.7.11 64-bit ('pw3': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
